A Simple, Straightforward and Effective Model for Joint Bilingual Terms Detection and Word Alignment in SMT
نویسندگان
چکیده
Terms extensively exist in specific domains, and term translation plays a critical role in domain-specific statistical machine translation (SMT) tasks. However, it’s a challenging task to extract term translation knowledge from parallel sentences because of the error propagation in the SMT training pipeline. In this paper, we propose a simple, straightforward and effective model to mitigate the error propagation and improve the quality of term translation. The proposed model goes from initial weak monolingual detection of terms based on naturally annotated resources (e.g. Wikipedia) to a stronger bilingual joint detection of terms, and allows the word alignment to interact. The extensive experiments show that our method substantially boosts the performance of bilingual term detection by more than 8 points absolute F-score. And the term translation quality is substantially improved by more than 3.66% accuracy, as well as the sentence translation quality is significantly improved by 0.38 absolute BLEU points, compared with the strong baseline, i.e. the well tuned Moses.
منابع مشابه
HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alig...
متن کاملBilingual Multi-Word Term Tokenization for Chinese–Japanese Patent Translation
We propose to re-tokenize data with aligned bilingual multi-word terms to improve statistical machine translation (SMT) in technical domains. For that, we independently extract multi-word terms from the monolingual parts of the training data. Promising bilingual multi-word terms are then identified using the sampling-based alignment method by setting some threshold on translation probabilities....
متن کاملAn Efficient Framework to Extract Parallel Units from Comparable Data
Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both ...
متن کاملExtraction of Bilingual Technical Terms for Chinese-Japanese Patent Translation
The translation of patents or scientific papers is a key issue that should be helped by the use of statistical machine translation (SMT). In this paper, we propose a method to improve Chinese–Japanese patent SMT by premarking the training corpus with aligned bilingual multi-word terms. We automatically extract multi-word terms from monolingual corpora by combining statistical and linguistic fil...
متن کاملGiven Bilingual Terminology in Statistical Machine Translation: MWE-Sensitve Word Alignment and Hierarchical Pitman-Yor Process-Based Translation Model Smoothing
This paper considers a scenario when we are given almost perfect knowledge about bilingual terminology in terms of a test corpus in Statistical Machine Translation (SMT). When the given terminology is part of a training corpus, one natural strategy in SMT is to use the trained translation model ignoring the given terminology. Then, two questions arises here. 1) Can a word aligner capture the gi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016